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Abstract 

The paradigm for nlp known as statistical language learning (sll) has flour- 
ished in recent times, being seen as a quick and easy way to get off the ground. Re- 
search systems have been launched at many nlp problems including sense disambiguation 
(Yarowsky, 1992), anaphora resolution (Dagan and Itai, 1990), prepositional phrase at- 
tachment (Hindle and Rooth, 1993) and lexical acquisition (Brent, 1993). This has all 
been fueled by the large text corpora which are increasingly available (Marcus et al, 1993). 
Since these systems learn to navigate language by consuming text, they are critically de- 
pendent on the data that drives them. 

In this paper I address the practical concern of predicting how much training data is 
sufficient for a given system. First, I briefly review earlier results and show how these can 
be combined to bound the expected accuracy of a mode-based learner as a function of the 
volume of training data. I then develop a more accurate estimate of the expected accuracy 
function under the assumption that inputs are uniformly distributed. Since this estimate 
is expensive to compute, I also give a close but cheaply computable approximation to 
it. Finally, I report on a series of simulations exploring the effects of inputs that are not 
uniformly distributed. 

1 Background 

1.1 Do We Need To Know? 

Even though text is becoming increasingly available, it is often expensive, especially if it 
must be annotated. Consider the decisions facing the SLL technology consumer, that is, the 
architect of a planned commercial nlp system. For each module which is to employ SLL, an 
appropriate technique must be selected. If different techniques require different amounts of 
data to achieve a given accuracy, the architect would like to know what these requirements 
are in advance in order to make an informed choice. 

Further, once the technique is chosen, she must decide how much data to collect or pur- 
chase for training. Because this data can be expensive, foreknowledge of data requirements 
is highly valuable. Thus, in order to make statistical NLP technology practical, a predictive 
theory of data requirements is needed. Despite this need, very little attention has been paid 
to the problem.]^ 

*This paper has been accepted for publication at the Eigth Australian Joint Conference on Artificial Intel- 
ligence, Canberra, 1995. 

^See de Haan (1992) for an investigation of sample sizes for linguistic studies. 



1.2 Foundations For A Theory 

All the SLL systems mentioned above employ knowledge gained from a corpus to make deci- 
sions. Abstractly, this knowledge can be represented as a mapping from observable features 
(inputs) to decision outcomes (outputs). Following Lauer (1995) I will call each distinguished 
input a BIN and each possible output a value. There is a probability distribution across 
the bins representing how instances fall into bins. Also, for each bin, there is a probability 
distribution across the set of values representing how instances in that bin take on values. 
For the system to perform accurately, most (but not necessarily all) of the instances falling 
in a particular bin must have the same value. 

In what follows I will make several assumptions: Training and test data are drawn from the 
same distributions. The set of possible values is binary (examples include Hindle and Rooth, 
1993 and Lauer, 1994). The probability of the most likely value in each bin is constant.^ 
Finally, I will only consider a simple learning algorithm: collect the training instances falling 
into each bin and then select the most frequent value for each. This mode-based learner is 
employed directly in the unigram tagger of Charniak (1993, p49) and is at the heart of many 
systems. 

1.3 Optimal Accuracy 

There are two sources of error in statistical language learners of the kind we are considering. 
First, since the values are not necessarily fully determined by the bins, no matter what value 
the learner assigns to a bin there will always be errors (the optimal error rate). Second, since 
training data is limited, the learner may not have sufficient data available to acquire accurate 
rules. The combination of these sources of error results in some degree of inaccuracy for the 
system. We are interested in estimating the accuracy for various volumes of training data. 
Since the optimal error rate is independent of the amount of training data, it will always exist 
no matter how much data is used. As the amount of training data increases we expect the 
accuracy to get closer to this optimal. 

Let B be the set of bins, V the set of values, Pr(6) the probability that an instance falls 
into the bin b and Pr(t> | b) the probability of the value v given the bin b. If we denote the 
most likely value in each bin as Vb = argmax^gy Pr(u | b), then the expected value of the 
optimal accuracy is determined by the likelihood of this value occurring in each bin. 

OA = ^ Pr(6) Pr{vb \ b) (1) 
beB 

If we know the probability that an algorithm will learn the value v for the bin b (denote 
this Pr(learn(6) = v)), then we can also calculate the expected accuracy rate: 

EA = ^ Pr(6) J2 Pr(learn(6) = v) Fr{v \ b) (2) 

beB vev 

In Lauer (1995) several results are shown concerning the relationship of these two values. I 
will summarise these in section ^?l] (see equations (|3|) and (Q)). 

■^Note that this does not require that the most hkely value be the same value in each bin; only that whatever 
the most likely value is has a constant probability. 
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2 Existing Work 



2.1 Empty Bins and Non-empty Bins 

The most severe result of insufficient training data is that some bins can go without any 
training instances. Since the learner has no indications about likely values for the bin it 
will be forced to guess. To estimate how often this will occur, consider the way in which 
m training instances would fall into the bins. For each bin, the probability that no training 
instances fall into it is: 

Pr(count(6) = 0) = (1 - Pr(6))™ 

I wiU caU such bins empty bins. 

In Lauer (1995) it is shown that for any bin b: 

Pr(count(6) = 0) < e"™/!^! (3) 

Lauer (1995) also bounds the expected accuracy of the mode-based learner when all bins 
are guaranteed to have at least one training instance. When this is the case, it is shown that 
the expected error rate is always no worse than twice the optimal error rate. 

EA > (1 - 2(1 - OA)) (4) 

This is quite a useful result, since we expect the optimal accuracy to be fairly high. If 
the optimal predictions are 90% accurate, then a mode-based learner will be at least 80% 
accurate after learning on just one instance per bin. 

2.2 Overall Expected Accuracy 

Unfortunately, we cannot normally guarantee that no bins will be empty, since the corpus is 
typically a random sample. However, we can combine equations (|3|) and (^ to arrive at a 
bound for the overall expected accuracy after training on a random sample. Over non-empty 
bins, we know that the error rate is no worse than twice the optimal error rate for those bins. 
Since we have assumed that Pr{vb \ b) is constant (call this p), we can infer that the optimal 
accuracy for the non-empty bins is the same as the optimal accuracy on all bins. Thus: 

EA = Pr(non-empty)EA(non-empty) -|- Pr(empty)EA(empty) 

> (1 - e-"/l^l)EA(non-empty) + (e-™/l^l)EA(empty) 

> (1 _ e-'"/l^l)(l - 2(1 - OA)) + ie-™/l^l 

= (l-e-'"/t-^l)(2p-l) + ie-'"/l^l (5) 

The second step follows from the fact that E A (non-empty) > E A (empty) and equation (|3|). 
The third step follows from equation (^. 

3 Theory 

3.1 Estimating Expected Accuracy 

Given the assumptions in section |1.2| , we can arrive at a better estimate of the expected 
accuracy when the distribution of bins is uniform (that is, Pr(6) = r^). Let the total number 
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of training instances in a bin b he n and the number of these instances with value v be 
count (t;, b): 



Pr(count(f , b) > n/2) = ^ . Pr(t; | - Pr(t; | 6))"" 

i=rn+i/2] Vv 



If n is even, we must also add an additional term of l/2(^"2)Pi'(^ I b)"'^'^{l — Pt{v \ 6))"/^. 
This is because when there are equal numbers of both values in the bin, a random guess yields 
an expected accuracy of 50%. In the arguments below, I will treat all values of n as odd in 
order to simplify. The reader may check for herself that the results hold generally when the 
above extra term is included. 

Using the fact that V is binary, the total expected accuracy for test instances in bin b 
when it contains n training instances is: 



Pr{v = argmax^,gyCOunt(t;', b)) = ^ Pr(t; \ b) ^ . Pr(t; j - Pr(t> | 6))"" 

veV j=(n+l)/2 V V 



By summing over all possible numbers of training instances in a bin, we can arrive at an 
expression for the expected accuracy across all bins as follows: 

m n 

EA = ^ Pr(6) ^ binomial(n; m, Pr(6)) ^ Pr(?; | b) ^ binomial(i; n, Pr(?j | b)) 

beB n=0 v£V i=(n+l)/2 

where binomial(j; A;,p) = {^)p^{l — p)^~K 

To simplify this I have defined a function as follows: 

m n 

G(m, r,p) = ^ binomial(n; m, r) ^ binomial(z; n,p) 

n=0 j={n+l)/2 

A result which may be easily obtained by expansion is: 

G{m, r, 1 — p) = 1 — G(m, r, p) (6) 

Using the assumptions in section |1.2| and the uniform bin probabilities we can now proceed 
to simplify: 

-y m n 

EA = ^ 1 j- ^ binomial(n; m, -j :) ^ Pt{v j b) ^ binomial(i; n, Pr(v | 6)) 

bgB I ^ I n=0 I ^ I vGV j={n+l)/2 

^ m 1 " 

= ^ FbT ^ Pr(t> I b) ^ binomial(n; m, ) ^ binomial(i; n, Pr(t; | 6)) 

beB ' ' veV n=0 I I i=(n+l)/2 

= Er^EPr(Hfe)G'(m,^,Pr(t;|6)) 

beB II II II 

= {1 - p + {2p - l)G{m, p)) (7) 



The last step uses equation @ and J2beB jh\ 



1. 
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3.2 A Computable Bound for G 

The main difficulty witfi the function G is the appearance of (^) . Most corpus-based language 
learners use large corpora, so we expect the number of training instances, m, to be very large. 
So we need a more easily computable version of G. The following argument leads to a fairly 
tight lower bound to G for suitably chosen values of kj (see below) : 

m n 

G{m,r,p) = ^ ^ binomial(n; m, r)binomial(i; n,p) 

n=0 i=(n+l)/2 
(m-l)/2 m 

= ^ ^ binomial(n; m, r)binomial(n — j; n,p) 

j=0 n=2j+l 

j=0 n=2j+l V"'/ 
(m-l)/2 m I n / \ 

The first step rearranges the order of addition. The final step introduces a series of variables 
which limit the number of terms in the inner sum. The inequality holds for all kj < m. 
Notice that the kj may vary for each term of the outer sum. Since n < kj < m we can use 
the following relation: 

777 ' 

7 '-^>{m-kjr (8) 

[m-n)\ 

Letting Xj = rp^^^^ we can simplify as follows: 

j=0 P n=2j+l VJ 

im-l)/2 kj / \ n 

= E (i-'W—^y E "k 

j=0 P n=2j+l \J/ 

j=0 ^ n=2i+l V-^/ 

The last step introduces g and holds for all g < {m — l)/2. This is because in practice only 
the first few terms of the outer sum are significant. Thus for suitably chosen g, kj this is 
a cheaply compTitabIc lower bound for G. A program to compute this to a high degree of 
accuracy has been implemented. 
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Figure 1: Simulation results and theoretical predictions for 10000 bins 



4 Experiment 

4.1 Skewed Bins 

The assumption that bin probabilities are uniform is problematic. When bins are uniformly 
probable, the expected number of training instances in the same bin as a random test in- 
stance is (= J2b€B^^i^)J2^=o^^^i''^ training items fall into b)). But most distributions 
in language are highly skewed. Zipf's law states that word types are distributed logarith- 
mically (the nth most frequent word has probability proportional to ^). When this is true 
the expected number of training instances in the same bin as a random test instance is ap- 
proximately (0 sef^i)^ l^)' '^^^s '^^^ expect much more information to be available 
about typical test cases. 

4.2 Simulations 

Since the mathematics in section ^ cannot easily be generalised to different distributions, I 
have conducted several simulations in order to verify the mathematical results above and to 
explore the effect of using a skewed distribution of bins. 

These simulations use a fixed number of bins (10,000), allocating m training instances 
to the bins according to either a uniform or logarithmic distribution. It then measures the 
correctness of the mode-based learner on 1000 randomly generated test instances to arrive at 
an observed correctness rate.^ 

This process (training and testing) is repeated 30 times for each run, with the mean being 
recorded as the observed accuracy. The standard deviation is used to estimate a 5% t-score 
confidence interval. 

■'The results were generated using an optimal value probability of p — 0.9 (thus the optimal accuracy rate 
is 90%). Simulations with other values of p did not differ qualitatively. 
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4.3 Results 



Figure || shows five traces of accuracy as tlie volume of training data is varied. The lowest 
curve shows the old bound which can be achieved using the results in Lauer (1995), as 
represented by equation (|5|). The other dotted curve shows the expected accuracy predicted 
using equation (^) as approximated by the program described in section 3.2. The two further 
curves (with confidence interval bars) then show the results of simulations, using uniform and 
logarithmic bin distributions. 

As can be seen, the new bound given in this paper is accurate for uniform bin probabilities. 
However, when the bins are logarithmically distributed learning converges significantly more 
quickly, as suggested by the reasoning about expected number of relevant training instances 
(see section |4J| ) . Perhaps surprisingly though, the logarithmic distribution appears to even- 
tually fall behind the uniform one once there is plenty of data. This might be explained by 
the presence of very rare bins in the logarithmic distribution which thus take longer to learn. 
Both these observations are crucial to reasoning about data requirements for sll. 



5 Conclusion 

If commercial NLP systems are to be developed from the current batch of research prototypes 
for SLL, then a predictive theory of the data requirements of such systems is necessary. In this 
paper I have explored the dependence of the expected accuracy of a simple statistical learner 
on the volume of training data. When the probability distribution of inputs is uniform, I 
have shown how to compute the expected accuracy, a result backed up by simulations. In 
particular, an average of four training instances per bin can be expected to yield an error rate 
only 50% worse than the optimal error rate. 

When the distribution is non-uniform, simulations show that convergence can be much 
more rapid. Error rates only 50% worse than optimal result from only three training instances 
per bin. However, when data is abundant, non-uniform distributions result in higher error 
rates than the estimate produced by assuming uniformity. 
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